Using SAS/STAT to implement a multivariate adaptive outlier detection approach to distinguish outliers from extreme values
نویسنده
چکیده
Hawkins (1980) defines an outlier as “an observation that deviates so much from other observations as to arouse the suspicion that it was generated by a different mechanism”. To identify data outliers, a classic multivariate outlier detection approach implements the Robust Mahalanobis Distance Method by splitting the distribution of distance values in two subsets (within-the-norm and out-of-the-norm), with the threshold value usually set to the 97.5% Quantile of the Chi-Square distribution with p (number of variables) degrees of freedom and items whose distance values are beyond it are labeled out-of-the-norm. This threshold value is an arbitrary number, however, and it may flag as out-of-the-norm a number of items that are actually extreme values of the baseline distribution rather than outliers. Therefore, it is desirable to identify an additional threshold, a cutoff point that divides the set of out-of-norm points in two subsets extreme values and outliers. One way to do this – in particular for larger databases – is to Increase the threshold value to another arbitrary number but this approach requires taking into consideration the size of the dataset since size will affect the threshold separating outliers from extreme values. A 2003 article by Gervini (Journal of Multivariate Statistics) proposes “an adaptive threshold that increases with the number of items n if the data is clean but it remains bounded if there are outliers in the data.” In 2005 Filzmoser, Garrett and Reimann (Computers & Geosciences) built on Gervini’s contribution to derive by simulation a relationship between the number of items n, the number of variables in the data p and a critical ancillary variable for the determination of outlier thresholds. This paper implements the Gervini adaptive threshold value estimator using PROC ROBUSTREG and the SAS ChiSquare functions CINV and PROBCHI, available in the SAS/STAT environment. It also provides data simulations to illustrate the reliability and the flexibility of the method in distinguishing true outliers from extreme values.
منابع مشابه
Identification of outliers types in multivariate time series using genetic algorithm
Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...
متن کاملMultivariate outlier detection in exploration geochemistry
A new method for multivariate outlier detection able to distinguish between extreme values of a normal distribution and values originating from a different distribution (outliers) is presented. To facilitate visualising multivariate outliers spatially on a map, the multivariate outlier plot, is introduced. In this plot different symbols refer to a distance measure from the centre of the distrib...
متن کاملOutlier Detection in Wireless Sensor Networks Using Distributed Principal Component Analysis
Detecting anomalies is an important challenge for intrusion detection and fault diagnosis in wireless sensor networks (WSNs). To address the problem of outlier detection in wireless sensor networks, in this paper we present a PCA-based centralized approach and a DPCA-based distributed energy-efficient approach for detecting outliers in sensed data in a WSN. The outliers in sensed data can be ca...
متن کاملLocal multivariate outliers as geochemical anomaly halos indicators, a case study: Hamich area, Southern Khorasan, Iran
Anomaly recognition has always been a prominent subject in preliminary geochemical explorations. Among the regional geochemical data processing, there are a range of statistical and data mining techniques as well as different mapping methods, which serve as presentations of the outputs. The outlier’s values are of interest in the investigations where data are gathered under controlled condition...
متن کاملThe Art of Data Visualization: Detecting Multivariate Data Outliers Using an Interactive Approach
Successfully detecting outliers in multivariate data requires statistical and programming skills and can be very time consuming. Requests for outlier detection can come from different skills groups therefore it is more efficient and effective to allow users to interact directly with the data themselves. We have developed an interactive, web based data visualization application for outlier detec...
متن کامل